Measuring the effective complexity of cosmological models 
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We introduce a statistical measure of the effective model complexity, called the Bayesian com- 
plexity. We demonstrate that the Bayesian complexity can be used to assess how many effective 
parameters a set of data can support and that it is a useful complement to the model likelihood (the 
evidence) in model selection questions. We apply this approach to recent measurements of cosmic 
microwave background anisotropies combined with the Hubble Space Telescope measurement of the 
Hubble parameter. Using mildly non-informative priors, we show how the 3-year WMAP data im- 
proves on the first-year data by being able to measure both the spectral index and the reionization 
epoch at the same time. We also find that a non-zero curvature is strongly disfavored. We conclude 
that although current data could constrain at least seven effective parameters, only six of them are 
required in a scheme based on the ACDM concordance cosmology. 

PACS numbers: 98.80.Es, 02.50.-r, 98.70.Vc 
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I. INTRODUCTION 



The quest for a cosmological standard model is being 
driven by an increasing amount of high quality obser- 
vations. An important and natural question concerns 
the number of fundamental parameters of the underly- 
ing physical model. How many numbers are necessary to 
characterize the Universe? Or in other words, how com- 
plex is the Universe? Generally the term "complexity" 
is employed in a rather loose fashion: a more complex 
model is one with a larger number of parameters that 
can be adjusted over a large range to fit the model to the 
observations. In this paper we will try to measure the 
effective number of model parameters that a given set of 
data can support. Because of the connection to the data, 
we will call this the effective complexity or Bayesian com- 
plexity. Our main purpose is to present a statistically 
sound quantity that embodies in a quantitative way the 
above notion of complexity when a model is compared to 
the observations. 

Bayesian model comparison makes use of an Occam's 
razor argument to rank models in term of their quality 
of fit and economy in the number of free parameters. A 
model with more free parameters will naturally fit the 
observations better, but it will also be penalized for the 
wasted parameter space that the larger number of param- 
eters implies. Several studies have made use of Bayesian 
model comparison to assess the viability of different mod- 
els in the cosmological context |H L2- L| ■ ^ n ^ s work we 
show that Bayesian complexity is an ideal complement to 
model selection in that it allows to identify the number 



of effective parameters supported by the data. 

We start by introducing our notation and the funda- 
mentals of Bayesian statistics and model comparison. We 
then present the Bayesian complexity and illustrate its 
use in the context of a toy model in section ITTT1 In sec- 
tion [W] we apply it to observations of cosmic microwave 
background anisotropies and we conclude in section Ivl 



II. BAYESIAN MODEL SELECTION AND 
COMPLEXITY 

A. Model comparison 

We first briefly review the basic ingredients of Bayesian 
statistics and some relevant aspects of information the- 
ory. This serves both to introduce our notation and to 
remind the reader of the main points. We will use a fairly 
compact notation where possible and refer the reader to 
e.g. Q for the exact mathematical definitions. Specifi- 
cally, for an outcome a; of a random variable X we will 
write p{x) for the probability distribution function (pdf), 
ie the probability that X takes a certain value x. In 
the case of a multi-dimensional parameter space we will 
write p(x) as a short form of the joint probability over 
all components of x, p{xx,x%, . . . ,x n ). The conditional 
probability of x given y is written p(x\y). 

The starting point of our analysis is Bayes theorem: 



p(x\y,I) 



p(y\x,i)p(x\i) 
p(y\i) 



(i) 
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Here, the quantity / represents a collection of all external 
hypotheses and our model assumptions. 

Given the data d and a model M. with n free param- 
eters 8, statistical inference deals with the task of de- 
termining a posterior pdf for the free parameters of the 
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model, p(9\d,A4). The latter is computed via Bayes the- 
orem as 



p(0\d,M) = 



p{d\9,M)p(d\M) 
p(d\M) 



(2) 



On the right-hand side of this equation, p(d\6, M.) is the 
probability of obtaining the observed data given the pa- 
rameter value 9. Since the observed data are a fixed 
quantity, we interpret p(d\0, M) as a function of 9 and 
we call it the likelihood, C{9) = p(d\9,M). The prior 
pdf p(9\M) embodies our state of knowledge about the 
values of the parameters of the model before we see the 
data. There are two conceptually different approaches 
to the definition of the prior: the first takes it to be the 
outcome of previous observations (i.e., the posterior of a 
previous experiment becomes the prior for the next), and 
is useful when updating ones knowledge about the values 
of the parameters from subsequent observations. For the 
scope of this paper, it is more appropriate to interpret the 
prior as the available parameter space under the model, 
which then is collapsed to the posterior after arrival of 
the data. Thus the prior constitutes an integral part of 
our model specification. In order to avoid cluttering the 
notation, we will write n(9) — p(9\A4), with the model 
dependence understood whenever no confusion is likely 
to arise. 

The expression in the denominator of (J2J) is a normal- 
ization constant and can be computed by integrating over 
the parameters, 



p(d\M) = / d9n{9)C{9). 



(3) 



This corresponds to the average of the likelihood func- 
tion under the prior and it is the fundamental quantity 
for model comparison. The quantity p(d\M) is called 
marginal likelihood (because the model parameters have 
been marginalized), or in recent papers in the cosmolog- 
ical context, the evidence. In the following we shall refer 
to it as to the likelihood for the model ■ The posterior 
probability of the model is then, using Bayes theorem 
again, 



P (M\d) 



p(d\M)ir(M) 
p(d) ' 



(4) 



where ■n{M.) is the prior for the model. The quantity in 
the denominator on the right-hand side is just a normal- 
ization factor depending on the data alone, which we can 
ignore. When comparing two models, M.i versus Ai 2l 
one introduces the Bayes factor Byi-, defined as the ratio 
of the model likelihoods 



gCMijrf) = n(Mi) 
p(M 2 \d) tt(M 2 ) 



B12 



(•5) 



In other words, the prior odds of the models are updated 
by the data through the Bayes factor. If we do not have 
any special reason to prefer one model over the other be- 
fore we see the data, then n(Mi) — n(M2) = 1/2, and 



the posterior odds reduce to the Bayes factor. Alterna- 
tively, one can interpret the Bayes factor as the factor 
by which our relative belief in the two model is modified 
once we have seen the data. 



B. A Bayesian measure of complexity 

The usefulness of a Bayesian model selection approach 
based on the model likelihood is that it tells us whether 
the increased "complexity" of a model with more param- 
eters is justified by the data. But the number of free 
parameters in a model is only the simplest possible no- 
tion of complexity, which we call input complexity and 
denote by Co- A much more powerful definition of model 
complexity was given by Spiegelhalter et al 3, who in- 
troduced the Bayesian complexity, which measures the 
number of model parameters that the data can constrain: 



C b = -2(D KL (p,ir)- D KL 



(6) 



On the right-hand side, Dkl(p,k) is the Kullback- 
Leibler (KL) divergence between the posterior and the 
prior, representing the information gain obtained when 
upgrading the prior to the posterior via Bayes theorem: 



D K l(p, tt) 



P(*|d)bg«d* 

7T(0) 



(7) 



The KL divergence measures the relative entropy be- 
tween the two distributions (e.g. 3)0]. In Eq. 10, 

Dkl is a point-estimate for the KL divergence. If all 
parameters are well measured, then the posterior pdf 
will collapse into a small region around 9, and thus the 
KL divergence will approximately be given by Dkl = 
log p(9)/ V (9). By taking the difference, we compare the 
effective information gain to the maximum information 

gain we can expect under the model, Dkl- The factor 
of 2 is chosen so that C b — ► Co f° r highly informative 
data, as shown below. As a point estimator for 9 we em- 
ploy the posterior mean, denoted by an overbar. Other 
choices are possible, and we discuss this further below. 
We can use Eq. 0) and Bayes theorem to rewrite © 

as 

C b = ~2 J p(0\d,M)logC(0) + 2logC(9), (8) 

Defining an effective x 2 through the likelihood as C(9) cx 
exp(— x 2 /2) (any constant factors drop out of the dif- 
ference of the logarithms in Eq. ©) we can write the 
effective number of parameters as 



c b = x 2 {0)-x\0), 



(9) 



where the mean is taken over the posterior pdf. This 
quantity can be computed fairly easily from a Markov 
chain Monte Carlo (MCMC) run, which is nowadays 
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widely used to perform the parameter inference step of 
the analysis. 

We thus see that the effective number of parameters of 
the model is not an absolute quantity, but rather a mea- 
sure of the constraining power of the data as compared 
to the predictivity of the model, i.e. the prior. Hence 
Cb depends both on the data at hand and on the prior 
available parameter space. In fact, it is clear that the 
very notion of "well measured parameter" is not abso- 
lute, but it depends on what our expectations are, i.e. 
on the prior. For example, consider a measurement of 
f2tot, the total energy density of the Universe, expressed 
in units of the critical density. The current posterior un- 
certainty around O tot = 1 is about 0.02. Whether this 
means that we have "measured" the Universe to be flat 
(i.e. Otot = 1) or not depends on the prediction of the 
model we consider. If we take a generic prior in the range 
< O t ot < 2, then we conclude that current data have 
measured the Universe to be flat with moderate odds 
(a precise analysis gives odds of 18 : 1 in favor of the 
flat model, see |2|). On the contrary, in the framework 
of e.g. landscape theories, the prior range of the model 
is much narrower, say |O to t — 1|^10 -5 , and therefore 
current posterior knowledge is insufficient to deem the 
parameter measured. 

Eq. 10 is conceptually related to the Ax 2 approach 
used for example in [T(il | to analyze how many dark en- 
ergy parameters are measured by the data. In that work 
the decrease of the x 2 value within the confidence re- 
gions encompassing 68% and 95% of the posterior was 
measured and compared against the theoretical decrease 
of a multivariate Gaussian distribution with a given num- 
ber of degrees of freedom (see e.g. chapter 15.6 of 0). 
This in turn allowed to deduce the effective number of 
degrees of freedom of the % 2 . 

We now turn to an explicit demonstration of the power 
and usefulness of coupling the model likelihood with the 
Bayesian model complexity in model selection questions, 
by analyzing in some detail linear toy models. 



where the dependent variable y is a d-dimensional vec- 
tor of observations, 9 is a vector of dimension n of un- 
known regression coefficients and F is a d x n matrix 
of known constants that specify the relation between the 
input variables 9 and the dependent variables y |15| . Fur- 
thermore, e is a e?-dimensional vector of random variables 
with zero mean (the noise). If we assume that e follows a 
multivariate Gaussian distribution with uncorrelated co- 
variance matrix C = diag(r 2 , t|, . . . , rj), then the likeli- 
hood function takes the form 



p(y\o) 



l 



(2^/2 n-r, 



exp 



-\{b-AB y {b-A9) 



(11) 

where we have defined Ay = Fij/ri and hi = yt/Ti. This 
can be cast in the form 



p(y\°) = A exp 



- 1 -{9-9 ) t L{9- 



with the likelihood Fisher matrix L given by 

L = A*A 
and a normalization constant 



(12) 



(13) 



Co 



1 



(27r) d / 2 n 



■ exp 



1 



(b-e A)*{b-ABo) 



(14) 

Here 9q denotes the parameter value that maximizes the 
likelihood, given by 

6»o = L~ 1 A t b. (15) 
As a shortcut, we will write 

X 2 {9) = -2\np(y\9) = x 2 (do) + (0-Oo) t L(9-9 o ), (16) 
where x 2 ($o) = — 21n£ - 

B. Model likelihood and complexity 



III. LINEAR MODELS 

Before applying the Bayesian complexity and model 
likelihood to current cosmological data, we compute them 
explicitly for a linear model and we illustrate their use in 
a toy example involving fitting a polynomial of unknown 
degree. We show below how the Bayesian complexity 
tells us how many parameters the data could in principle 
constrain given the prior expectations under the model. 



A. Specification of the likelihood function 

Let us consider the following linear model 

y = F9 + e (10) 



In this section we gain some intuitive feeling about 
the functional dependence of the model likelihood and 
complexity on the prior and posterior for the simple case 
of linear models outlined above. The results are then 
applied in section lill Dl to an explicit toy model, showing 
the model likelihood and complexity in action. 

Assuming as a prior pdf a multinormal Gaussian distri- 
bution with zero mean and Fisher information matrix P 
(we remind the reader that the Fisher information matrix 
is the inverse of the covariance matrix), i.e. 



7r(9) 



IP] 1 / 2 

(2tt)™/ 2 



exp 



(17) 



where \P\ denotes the determinant of the matrix P, the 
model likelihood © and model complexity 10 of the 
linear model above are given by Eqs. (|A3|) and (|A9|) . re- 
spectively (see Appendix |A"|) . 
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Let us now consider the explicit illustration of a model 
with n parameters, 9 = (9%, . . . , n ) and Cq — n. Without 
losing generality, we can always choose the units so that 
the prior Fisher matrix is the unity matrix, i.e. 



P = Id,, 



(18) 



This choice of units is natural since it is the prior that sets 
the scale of the problem. The likelihood Fisher matrix 
being a symmetric and positive matrix, it is characterized 
by n(n + l)/2 real numbers, which we choose to be its 



eigenvalues I /erf, i = 1, 



and the elements of the 



orthogonal matrix U that diagonalizes it (corresponding 
to rotation angles). Here the o~i represent the standard 
deviations of the likelihood covariance matrix along its 
eigendirections i, expressed in units of the prior width. 
With D = diag^ 2 , . . . , a^ 2 ) we have that the likelihood 
Fisher matrix is given by L = UDU 1 and thus Eq. lf&9} 
gives 



C b = C Q - tr 
= C - tr 



(UDU* +ld n )' 



(19) 



i=l 



The complexity only depends on how well we have mea- 
sured the eigendirections of the likelihood function with 
respect to the prior. Every well-measured direction, (i.e., 
one for which o~i <C 1) counts for one parameter in C b , 
while directions whose posterior is dominated by the 
prior, (Tj > 1, do not count towards the effective com- 
plexity. This automatically takes into account strong de- 
generacies among parameters. Notice also that once an 
eigendirection is well measured (i.e. in the limit <7i <C 1), 
then the prior width does not matter anymore. 

In contrast, the model likelihood (|A3|) is given by (as- 
suming for simplicity that the mean of the likelihood cor- 
responds to the prior mean, i.e. 9q = 0) 



P (d\M) = Co]J 



(20) 



Finally, we remark that an important ingredient of the 
Bayesian complexity is the point estimator for the KL 
divergence. Here we adopt the posterior mean as an es- 
timate, but other simple alternatives are certainly pos- 
sible, for instance the posterior peak (or mode), or the 
posterior median. The choice of an optimal estimator is 
still matter of research (see e.g. section llVl and the com- 
ments at the end of Ref. |2|). The important aspect is 
that the posterior pdf should be summarized by only one 
number, namely the value plugged into the KL estima- 
tor. This is obviously going to be a very bad description 
for highly complex pdf's, exhibiting for instance long, 
banana-shaped degeneracies. No single number can be 
expected to summarize accurately such a pdf. On the 
other hand, for fairly Gaussian pdf's all the different es- 
timators (mean, median and peak) reduce to the same 
quantity. This clearly calls for using normal directions in 



parameter space |l2j . which make the posterior as Gaus- 
sian as possible, a procedure that it would be wise to 
follow whenever possible for many other good reasons 
(e.g., better and faster MCMC convergence). 



C. Effective complexity as a data diagnostics 

We now turn to the question of how we can use the 
model likelihood and complexity together as a tool for 
model selection. We shall see that the Bayesian com- 
plexity provides a way to assess the constraining power 
of the data with respect to the model at hand and to 
break the degeneracy among models with approximately 
equal model likelihood. 

Let us consider two models A and B with different 
numbers of parameters, rig > (but in general the 
two models need not to be nested). If the additional 
parameters of model B are required by the data, then 
the likelihood of model B will be larger and B will have 
larger posterior odds, thus it should be ranked higher 
in our preference than model A. However, even if the 
extra parameters of model B are not strictly necessary, 
they can lead to over-fitting of the data and compensate 
the Occam's penalty factor in (|20|l sufficiently to lead to 
a comparable marginal likelihood for both models. The 
effective complexity provides a way to break the degener- 
acy between the quality-of-fit term (£ ) and the Occam's 
razor factor in the marginal likelihood, and enables us to 
establish whether the data is good enough to support the 
introduction of the additional parameters of model B. 

To summarize, we are confronted with the following 
scenarios: 

1. p(d\B) 3> p(d\A): model B is clearly favored over 
model A and the increased number of parameters 
is justified by the data. 

2. p(d\B) w p(d\A) and C b (B) > C b (A): the quality of 
the data is sufficient to measure the additional pa- 
rameters of the more complicated model, but they 
do not improve its likelihood by much. We should 
prefer model A, with less parameters. 

3. p(d\B) w p(d\A) and C b (B) » C b (A): both mod- 
els have a comparable likelihood and the effective 
number of parameters is about the same. In this 
case the data is not good enough to measure the ad- 
ditional parameters of the more complicated model 
and we cannot draw any conclusions as to whether 
the additional complexity is warranted. 

We illustrate these cases by computing the model likeli- 
hood and effective complexity of a toy model in the next 
section. 
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D. An illustrative example 

As a specific example of the linear model described in 
section 1111 Al we consider the classic problem of fitting 
data drawn from a polynomial of unknown degree. The 
models that we test against the data are a collection of 
polynomials of increasing order, with input complexity 
Co = n, where n is the order of the polynomial. The 
question is then whether our model selection can cor- 
rectly recover the order m of the polynomial from which 
the data are actually drawn. 

The data covariance matrix is taken to be diagonal 
and with a common standard deviation for all points, 
s, while the prior over the polynomial coefficients is a 
multivariate Gaussian with covariance matrix given by 
the identity matrix. For definitiveness, we will take the 
underlying model from which the data are generated to 
have m — 6 parameters. First we draw p = 10 data 
points with noise s = 1/200. We plot in Figure ^ the 
resulting model likelihood and effective complexity as a 
function of the input model complexity, n. The likelihood 
of the models increases rapidly until n — m and then 
decreases slowly, signaling that n > 6 parameters are not 
justified. (We plot the logarithm of the logarithm of the 
model likelihood as the models with n < 6 parameters 
are highly disfavored and would otherwise not fit onto 
the figure - an example of case (1) of the list in the 
previous section). The effective complexity on the other 
hand continues to grow until n ~ p, at which point the 
data is unable to constrain more complex models and 
Cb becomes constant. We conclude that the model with 
n = 6 is the one preferred by data and that additional 
parameters are not needed, although the data could have 
supported them. This is case (2) in the list of the previous 
section. 

Next we decrease the number of data points to p = 4, 
in which case we obviously cannot recover more than four 
parameters. It comes as no surprise that the model like- 
lihood stops increasing at n = 4, see Figure |5| But the 
effective complexity also flattens at n = 4, which means 
that the data cannot deal with more than four parame- 
ters, irrespective of the underlying model! In this case, 
corresponding to point (3) of the list in the previous sec- 
tion, wc conclude that the available data do not support 
more than 4 effective parameters. On the other hand, 
we recognize that the flattening of the model likelihood 
at n — 4 does not necessarily mean that the underly- 
ing model has four parameters. We thus must hold our 
judgment until better data become available. 

As an alternative to decreasing the number of data 
points, we can achieve a similar data degradation effect 
by keeping p = 10 data points but by increasing the noise 
to s = 1 . We obtain a result similar to the previous case, 
which is plotted in Figure 

We conclude by emphasizing once more that in general 
the outcome of model selection based on assessing the 
model likelihood and effective complexity depends on the 
interplay of two factors. The first is the predictive power 
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number of parameters 

FIG. 1: Bayesian effective complexity Cb (solid black line, 
left-hand vertical scale) and model likelihood (red circles, 
right-hand scale) as a function of the number of parameters, 
for d — 10 data points with small noise. The dashed blue line 
is the number of parameters for reference. The errorbars on 
the model likelihood values are smaller than the symbols on 
this scale, while the Bayesian complexity is independent of 
the noise realization (i.e., error-free) for linear models. The 
Bayesian analysis correctly concludes that the best model is 
the one with n = 6 parameters. 



of the model, as encoded in the prior. The second is the 
constraining power of the data. 



IV. HOW MANY PARAMETERS DOES THE 
CMB NEED? 

We now apply the above tools to the question of how 
many cosmological parameters are necessary to describe 
current cosmic microwave background (CMB) anisotropy 
measurements. We make use of the following CMB data: 
WMAP, ACBAR, CBI, VSA and Boomerang 2003. To 
provide an additional regularization (especially in view of 
including spatial curvature) we also use the HST limits 
on the Hubble parameter, Hq = 72 ± 8 km/s/Mpc. We 
should note that this strongly increases the power of the 
CMB data. We use both the first-year WMAP alone 
(WMAP1) as well as the data for the first three years 
(WMAP3). 

For each set of cosmological parameters we create a 
converged MCMC chain using the publicly available code 
cosmoMC We then compute the Bayesian complex- 
ity from the chain through Eq. 0. The model likelihood 
is evaluated with the Savage-Dickey method (see and 
references therein): For a model that is nested within 
a larger model by fixing one parameter, Of, to a value 
(?o, the Bayes factor between the two models is given by 
the posterior of the larger model at 6f = 8q (normal- 
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FIG. 2: As in Figure^ but now using only p — 4 data points. 
The maximum effective complexity that the data can support 
is C& = 4, and the flattening of the model likelihood at the 
same value does not allow to conclude that models with more 
parameters are disfavored. 
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FIG. 3: As in Figure ^ but now with p = 10 data points 
and large noise. As in Figure [5] the maximum complexity 
supported by the data is smaller than the underlying true 
model complexity, m — 6. 



ized and marginalized over all other parameters) divided 
by the prior at this point. In this way it is possible to 
derive all the model likelihoods in a hierarchy, starting 
from the most complex model (which is assigned an arbi- 
trary model likelihood, in our case 1). Since errors tend 
to accumulate through the intermediate steps necessary 
to reach the simpler models, and as a cross-check, we 
additionally computed the model likelihoods with nested 
sampling 3] for the WMAP1 data. Within the errorbars, 



we did not find any appreciable discrepancy between the 
two methods. 

In this analysis we use four basic cosmological param- 
eters, namely 



{n b h 2 ,n m h 2 ,e*,A s }, 



(21) 



where fib (^m) is the baryon (matter) density relative 
to the critical energy density, h = TJo/100 km/s/Mpc 
is the fudge factor, 6+ is the ratio of the sound hori- 
zon to the last scattering angular diameter distance and 
A s = lnP(fco), with P(fco) the power spectrum of adi- 
abatic density fluctuations at a scale fco = 0.05 Mpc -1 . 
We then add three more parameters in various combina- 
tions, to study whether they are necessary and supported 
by the observations. The additional parameters are: The 
reionization optical depth r, the scalar spectral index n s 
and the spatial curvature (parameterized by its contribu- 
tion to the Hubble equation, O^). 

The scalar spectral index is either fixed to n s = 1 (the 
case of a scale invariant power spectrum of initial fluctua- 
tions), or else chosen with a Gaussian prior, n s — 1±0.1. 
We find that Tr(n s = 1) = 1/(0.1%/2tt) w 4. We choose 
the prior of the reionization optical depth to reflect our 
current lack of understanding of how reionization pro- 
ceeds. We choose it flat within < r < 0.15 and then 
add an exponential falloff: 



7t(t) oc exp 



T-0.15 
0.05 



for r > 0.15. 



(22) 



The models without this parameter have r = 0, and 
7t(t = 0) = 5. The curvature contribution is either set 
to zero, fix = 0, if we do not include the parameter, or 
else is used with a flat prior — 1 < JTr- < 1. The value of 
the prior for a flat universe is w(£Ik = 0) = 1/2. We find 
that adding curvature as a free parameter when using 
the WMAP 3yr data leads to a non-Gaussian posterior 
for which the mean as a point-estimate for the KL diver- 
gence in Eq. JSJ is not representative. We opt here for a 
slightly modified estimator, given by the average of the 
X 2 evaluated at the mean and the mode of the posterior. 
For a Gaussian posterior this reduces to the mean point- 
estimator but it appears to be somewhat more stable. 

We quote our results in Table [Q (WMAP3) and Ta- 
ble [n](WM API), while Figure 0] gives a graphical repre- 
sentation. The model likelihoods are quoted relative to 
the model with the most parameters. We find that us- 
ing WMAP1 we can measure all the parameters for the 
models with four and five parameters. For Cq > 5, the 
effective complexity increases more slowly than the num- 
ber of input parameters, but we can still measure at least 
six parameters with CMB+HST. With WMAP3 we can 
measure all six parameters of the 64 + n s + r model. We 
conclude that the new WMAP3 data augmented by the 
HST determination of Hq can measure all seven param- 
eters considered in this analysis. 

Taking into account the model likelihood values, we 
find that the models 64, 64 + VLk and (to a lesser extent) 
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Model 


Model likelihood 


Co Effective complexity 


Comments 


64 + n s + Ok + t 


1 


7 


6.9 ±0.3 


Too many parameters 


bi + n s ± Ok 


0.035 ± 0.005 


6 


6.3 ±0.2 


Ok disfavored 


&4 + n s ± t 


48 ± 2 


6 


6.0 ±0.04 


n s ± r favored 


64 + Ok + t 


0.04 ±0.01 


6 


6.4 ±0.3 


Ok disfavored 


bi + n s 


2.2 ±0.3 


5 


5.0 ±0.04 


n s necessary 


&4 + Ok 


(1.5 ±0.5) x 10~ 5 


5 


4.8 ±0.04 


Ok disfavored 


bi + r 


3.5 ±1 


5 


4.9 ±0.04 


r necessary 


bi 


(1.5 ±0.5) x 10~ 3 


4 


4.0 ±0.04 


Strongly disfavored 



TABLE I: Relative model likelihood (normalized to the model with the most parameters) with WMAP 
3yr data and effective Bayesian complexity for the models discussed in the text. Co gives the number of 
parameters of the model. The error on the effective complexity was computed from random sub-chains 
and represents only the statistical error. 



Model 


Model likelihood 


Effective complexity 


64 ± n s ± Q. K + r 


1 


6.2 ±0.1 


64 ± n s ± Ok 


0.06 ±0.01 


6.0 ±0.3 


64 + n s + t 


42 ± 5 


5.5 ±0.02 


64 + Ok + t 


0.68 ±0.15 


5.6 ±0.2 


bi ± n s 


4.5 ±0.9 


4.9 ± 0.03 


64 + Ok 


(1.5 ±0.5) x 10~ 4 


5.0 ±0.09 


64 ±T 


17 ±5 


4.8 ± 0.03 


64 


(2.3 ±0.6) x 10~ 3 


4.0 ±0.05 



TABLE II: Model likelihood and complexity as inferred from the first year WMAP data, for comparison 
with the values in table We see how the WMAP3 data increases the model likelihood of 64 ± n s ± r 
relative to the simpler models 64 + r and 64 ± n s . 



b4+n s +flK as well as bi+r+flx are strongly disfavoured. 
This shows that fix = is preferred by current data, in 
agreement with the result of Ref. Q. In general, adding 
in a non-zero spatial curvature leads to a well measurable 
decrease in the model probability that, together with the 
increase in effective model complexity, reinforces our be- 
lief that VIk can be safely neglected for the time being. 
Of course this result is partially a reflection of our choice 
of prior on CIk- However, it is important to remember 
that had we halved the range of this prior, the likeli- 
hoods for models with non-zero curvature would have 
only doubled. This would not change the results signifi- 
cantly. Alternatively, an inflation-motivated prior of the 
type I fix I <C 1 would render the parameter unmeasured 
and irrelevant. In this case adding it would not change 
the effective complexity or the model likelihood at all. 

We also find that the basic set 64 must be augmented 
by either n s or r. The inclusion of both parameters at 
the same time was optional with the first year WMAP 
data only, but using WMAP3 we find that 64 + n s ± t 
has a significantly higher model likelihood than all other 
models investigated here. Also, where with WMAP1 we 
only gained half an effective parameter when going from 
64 ± t or 64 ± n s to 64 ± n s ± r, we now gain one full 
parameter. Thus we can now measure both parameters 
at the same time. 

Overall, we conclude that 64 ± r was a good and suffi- 



cient base model until a few months ago. Now 64 ± n s ± r 
should be used. A wider prior on n s would have only a 
minimal impact on the complexity of models including 
a tilt, since n s seems to be rather well-measured when 
considered alone (ie, not in combination with t). In- 
clusion of a non-vanishing curvature is discouraged by 
Bayesian model comparison. We find that a model with 
6 parameters is sufficient to explain the current CMB 
data, even though all seven effective parameters can be 
constrained now. This analysis demonstrates that the 
Bayesian complexity estimator (0 works with real-world 
data and gives useful additional information for model 
comparison. 



V. CONCLUSIONS 

In this work we introduced the Bayesian complexity 
as a measure of the effective number of parameters in 
a model. We discussed extensively its properties and 
its usefulness in the context of linear models, where it 
can be computed analytically. We showed that it corre- 
sponds to the number of parameters for which the width 
of the posterior probability distribution is significantly 
narrower than the width of the prior probability distri- 
bution. These parameters can be considered to have been 
well measured by the data given our prior assumptions 
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FIG. 4: We plot the model likelihood (normalized to the 
model with the most parameters) versus the Bayesian effective 
complexity for the models of Table A downward-pointing 
arrow indicates the Bayesian complexity of models that lie 
outside the boundary of the figure. 



in the model. 

We also showed that for linear models the Bayesian 
complexity probes the trace of the posterior covariance 
matrix, while the model likelihood is sensitive to the de- 
terminant. We argued that the Bayesian complexity al- 
lows to test for cases where the data is not informative 
enough for the model likelihood to be a reliable indicator 
of model performance, and provided an explicit example. 

Finally, we applied the combination of model likeli- 
hood and Bayesian complexity to the question of how 
many (and which) parameters are measured by current 
CMB data, complemented by the HST limit on the Hub- 
ble parameter. We limited ourselves to the family of 
ACDM models with a power-law spectrum of primor- 
dial perturbations. We demonstrated that - in addition 
to the energy density in baryons and matter, the CMB 
peak location parameter 8+ and the amplitude of the ini- 
tial perturbations - we need to consider now both the 
rcionisation optical depth and the scalar spectral index. 
Non-flat models are disfavoured. The effective complex- 
ity shows that the CMB data can measure all seven pa- 
rameters in this scheme. 

As the Bayesian complexity is very easy to compute 
from a MCMC chain, we hope that it will be used rou- 
tinely in future data analyses in conjunction with the 
model likelihood for model building assessment. It will 
help to determine if the data is informative enough to 
measure the parameters under consideration. Further 
work is needed to study the performance of the Bayesian 
complexity in situations with a strongly non-Gaussian 
posterior. 
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APPENDIX A: MODEL LIKELIHOOD AND 
COMPLEXITY IN THE LINEAR CASE 

Here we compute first the model likelihood © for the 
linear model using the Gaussian prior (|T7Jl . An 

analogous computation can be found in -Qj. Returning 
to Bayes theorem QJ, the posterior pdf is given by a 
multinormal Gaussian with Fisher information matrix F 



F = L + P 

and mean 9 given by 

6 = F^LBq. 
The model likelihood 10 evaluates to 



p{d\M) = C 



IF]- 1 / 2 
|P|-Va 

\F\- 1 / 2 



exp 



■ exp 



--6l{L-LF- l L)6 



e t FB) 



(Al) 
(A2) 

(A3) 



This can be easily interpreted by looking at its three 
components: the quality of fit of the model is encoded 
in Co, which represents the best-fit likelihood. Thus a 
model that fits the data better will be favored by this 
term. The term involving the determinants of P and F 
is a volume factor (the so called Occam's factor). As 
|-P| < \F\, it penalizes models with a large volume of 
wasted parameter space, i.e. those for which the param- 
eter space volume |F| -1 / 2 that survives after arrival of 
the data is much smaller than the initially available pa- 
rameter space under the model prior, |P| -1 / 2 . Finally, 
the exponential term suppresses the likelihood of models 
for which the parameters values that maximize the like- 
lihood, 9q, differ appreciably from the expectation value 
under the posterior, 9. Therefore when we consider a 
model with an increased number of parameters we see 
that its model likelihood will be larger only if the quality- 
of-fit increases enough to offset the penalizing effect of 
the Occam's factor. 

Let us now turn to the computation of the Bayesian 
complexity, Eq. ©. Using the posterior mean (denoted 
by an overbar) as a point estimator for the effective chi- 
square, we obtain from l|16|) 



X 2 (0)=X 2 (Oo) + (0-B o ) t L(B-6 o ). 



(A4) 
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The expectation value of the \ 2 under the posterior is 
given by 

xHff) = x 2 (9o) + tr (L((9 - 9 Q )\0 - ))) . (A5) 

We concentrate on the second term and write (0 — 9q) = 
(8—9)+(6—6o) = u+v. The total term in the expectation 
value then becomes (u t u+u t v+v t u+v t v) . The first term 
of this expression is just the posterior covariance matrix 
= F~ x . The last term combines with x 2 ($o) to 
% 2 ((9). The cross terms vanish since (u) = 0. 

All taken together, we obtain for the complexity 

C b = tr(L<(0-0)*(0-0)» (A6) 
= tr (LF' 1 ) (A7) 



Using the relation i|Al|l we can rewrite the complexity as 

C b = tr{(F-P)F- 1 } (A8) 
= Co-trjPF- 1 }. (A9) 

Thus while the model likelihood depends on the determi- 
nant of the Fisher matrices, the complexity depends on 
their trace. Another important point worth highlighting 
is that for linear models the complexity does not depend 
on the degree of overlap between the prior and the pos- 
terior, nor on the noise realization (as long as the noise 
covariance matrix is known). 
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